In the article a new index for determining the number of clusters in a data set is proposed. The index is based on multiple division of the data set (or a part of it) into two clusters and checking if this division should be retained or neglected. The checking criterion is the Rand index by means of which the extent to which the primary division and the second division of the narrower subset consisting of the smaller cluster from the primary division and 1/3 of the bigger cluster coincide. The divisions are made by means of the classical k-means (for k=2) with multiple random choice of starting points. The efficiency of the new index was examined in a broad experiment on a couple of thousands of data sets generated to possess cluster structures with different number of variables, clusters, cluster densities and different variants of within cluster correlation. Moreover, the cluster overlap controlled according to the OCLUS algorithm was also varied. A basis for efficiency assessment was the comparison with two other leading indices i.e. Caliński-Harabasz index and the Gap index. The efficiency of the new index proposed is higher than that of the competition when the cluster structure is not very distinct.
cluster analysis, number of clusters in a data set, Caliński-Harabasz index, Gap index
Caliński R. B., Harabasz J., (1974), A Dendrite Method for Cluster Analysis, Communications in Statistics, 3, 1–27.
Gatnar E., Walesiak M., (red.), (2004), Metody Statystycznej Analizy Wielowymiarowej w Badaniach Marketingowych, Wydawnictwo AE we Wrocławiu.
Korzeniewski J., (2005), Propozycja nowego algorytmu wyznaczającego liczbę skupień, Prace Naukowe AE we Wrocławiu nr 1076, Taksonomia 12, 257–265.
Korzeniewski J., (2012), Metody selekcji zmiennych w analizie skupień. Nowe procedury, Wydawnictwo Uniwersytetu Łódzkiego.
Migdał-Najman K., Najman K. (2005), Analityczne metody ustalania liczby skupień, Prace Naukowe AE we Wrocławiu nr 1076, Taksonomia 12, 265–273.
Milligan G. W., Cooper M., (1985), An Examination of Procedures for Determining the Number of Clusters in a Data Set, Psychometrika, 2, 159–179.
Mojena R. (1977), Hierarchical Grouping Methods and Stopping Rules: an Evaluation, Computer Journal, 20 (4), 359–363.
Najman K., Migdał-Najman K., (2006), Wykorzystanie indeksu Silhouette do ustalania optymalnej liczby skupień, Wiadomości Statystyczne, 6, 1–10.
Sokołowski A., (1992), Empiryczne testy istotności w taksonomii, Zeszyty Naukowe AE w Krakowie, Seria specjalna: Monografi e nr 108.
Steinley D., Henson R., (2005), OCLUS: An Analytic Method for Generating Clusters with Known Overlap, Journal of Classifi cation, 22, 221–250.
Tibshrani R., Walther G., Hastie T., (2001), Estimating the Number of Clusters in a Dataset via the Gap Statistic, Journal of the Royal Statistical Society, 32, 411–423.
Wieczorkowski R., Zieliński R., (1997), Komputerowe generatory liczb losowych, Wydawnictwa Naukowo Techniczne, Warszawa.